14 research outputs found

    The Maximum Exposure Problem

    Get PDF
    Given a set of points P and axis-aligned rectangles R in the plane, a point p in P is called exposed if it lies outside all rectangles in R. In the max-exposure problem, given an integer parameter k, we want to delete k rectangles from R so as to maximize the number of exposed points. We show that the problem is NP-hard and assuming plausible complexity conjectures is also hard to approximate even when rectangles in R are translates of two fixed rectangles. However, if R only consists of translates of a single rectangle, we present a polynomial-time approximation scheme. For general rectangle range space, we present a simple O(k) bicriteria approximation algorithm; that is by deleting O(k^2) rectangles, we can expose at least Omega(1/k) of the optimal number of points

    JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing

    Full text link
    Approximate query processing over dynamic databases, i.e., under insertions/deletions, has applications ranging from high-frequency trading to internet-of-things analytics. We present JanusAQP, a new dynamic AQP system, which supports SUM, COUNT, AVG, MIN, and MAX queries under insertions and deletions to the dataset. JanusAQP extends static partition tree synopses, which are hierarchical aggregations of datasets, into the dynamic setting. This paper contributes new methods for: (1) efficient initialization of the data synopsis in the presence of incoming data, (2) maintenance of the data synopsis under insertions/deletions, and (3) re-optimization of the partitioning to reduce the approximation error. JanusAQP reduces the error of a state-of-the-art baseline by more than 60% using only 10% storage cost. JanusAQP can process more than 100K updates per second in a single node setting and keep the query latency at a millisecond level

    A Fair and Memory/Time-efficient Hashmap

    Full text link
    There is a large amount of work constructing hashmaps to minimize the number of collisions. However, to the best of our knowledge no known hashing technique guarantees group fairness among different groups of items. We are given a set PP of nn tuples in Rd\mathbb{R}^d, for a constant dimension dd and a set of groups G={g1,,gk}\mathcal{G}=\{\mathbf{g}_1,\ldots, \mathbf{g}_k\} such that every tuple belongs to a unique group. We formally define the fair hashing problem introducing the notions of single fairness (Pr[h(p)=h(x)pgi,xP]Pr[h(p)=h(x)\mid p\in \mathbf{g}_i, x\in P] for every i=1,,ki=1,\ldots, k), pairwise fairness (Pr[h(p)=h(q)p,qgi]Pr[h(p)=h(q)\mid p,q\in \mathbf{g}_i] for every i=1,,ki=1,\ldots, k), and the well-known collision probability (Pr[h(p)=h(q)p,qP]Pr[h(p)=h(q)\mid p,q\in P]). The goal is to construct a hashmap such that the collision probability, the single fairness, and the pairwise fairness are close to 1/m1/m, where mm is the number of buckets in the hashmap. We propose two families of algorithms to design fair hashmaps. First, we focus on hashmaps with optimum memory consumption minimizing the unfairness. We model the input tuples as points in Rd\mathbb{R}^d and the goal is to find the vector ww such that the projection of PP onto ww creates an ordering that is convenient to split to create a fair hashmap. For each projection we design efficient algorithms that find near optimum partitions of exactly (or at most) mm buckets. Second, we focus on hashmaps with optimum fairness (00-unfairness), minimizing the memory consumption. We make the important observation that the fair hashmap problem is reduced to the necklace splitting problem. By carefully implementing algorithms for solving the necklace splitting problem, we propose faster algorithms constructing hashmaps with 00-unfairness using 2(m1)2(m-1) boundary points when k=2k=2 and k(m1)(4+log2(3mn))k(m-1)(4+\log_2 (3mn)) boundary points for k>2k>2

    Computing Shortest Paths in the Plane with Removable Obstacles

    Get PDF
    We consider the problem of computing a Euclidean shortest path in the presence of removable obstacles in the plane. In particular, we have a collection of pairwise-disjoint polygonal obstacles, each of which may be removed at some cost c_i > 0. Given a cost budget C > 0, and a pair of points s, t, which obstacles should be removed to minimize the path length from s to t in the remaining workspace? We show that this problem is NP-hard even if the obstacles are vertical line segments. Our main result is a fully-polynomial time approximation scheme (FPTAS) for the case of convex polygons. Specifically, we compute an (1 + epsilon)-approximate shortest path in time O({nh}/{epsilon^2} log n log n/epsilon) with removal cost at most (1+epsilon)C, where h is the number of obstacles, n is the total number of obstacle vertices, and epsilon in (0, 1) is a user-specified parameter. Our approximation scheme also solves a shortest path problem for a stochastic model of obstacles, where each obstacle\u27s presence is an independent event with a known probability. Finally, we also present a data structure that can answer s-t path queries in polylogarithmic time, for any pair of points s, t in the plane

    Computing Data Distribution from Query Selectivities

    Full text link
    We are given a set Z={(R1,s1),,(Rn,sn)}\mathcal{Z}=\{(R_1,s_1),\ldots, (R_n,s_n)\}, where each RiR_i is a \emph{range} in d\Re^d, such as rectangle or ball, and si[0,1]s_i \in [0,1] denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution} D={(q1,w1),,(qm,wm)}\mathcal{D}=\{(q_1,w_1),\ldots, (q_m,w_m)\}, where qjdq_j\in \Re^d and wj[0,1]w_j\in [0,1] for each 1jm1\leq j\leq m, and 1jmwj=1\sum_{1\leq j\leq m}w_j= 1, such that D\mathcal{D} is the most \emph{consistent} with Z\mathcal{Z}, i.e., errp(D,Z)=1ni=1n ⁣sij=1mwj1(qjRi)p\mathrm{err}_p(\mathcal{D},\mathcal{Z})=\frac{1}{n}\sum_{i=1}^n\! \lvert{s_i-\sum_{j=1}^m w_j\cdot 1(q_j\in R_i)}\rvert^p is minimized. In a database setting, Z\mathcal{Z} corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and D\mathcal{D} can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP\mathsf{NP}-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δd)δ2polylog)O((n+\delta^{-d})\delta^{-2}\mathop{\mathrm{polylog}}), a discrete distribution D~\tilde{\mathcal{D}} of size O(δ2)O(\delta^{-2}), such that errp(D~,Z)minDerrp(D,Z)+δ\mathrm{err}_p(\tilde{\mathcal{D}},\mathcal{Z})\leq \min_{\mathcal{D}}\mathrm{err}_p(\mathcal{D},\mathcal{Z})+\delta (for p=1,2,p=1,2,\infty) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely

    Efficient Algorithms for k-Regret Minimizing Sets

    Get PDF
    A regret minimizing set Q is a small size representation of a much larger database P so that user queries executed on Q return answers whose scores are not much worse than those on the full dataset. In particular, a k-regret minimizing set has the property that the regret ratio between the score of the top-1 item in Q and the score of the top-k item in P is minimized, where the score of an item is the inner product of the item\u27s attributes with a user\u27s weight (preference) vector. The problem is challenging because we want to find a single representative set Q whose regret ratio is small with respect to all possible user weight vectors. We show that k-regret minimization is NP-Complete for all dimensions d>=3, settling an open problem from Chester et al. [VLDB 2014]. Our main algorithmic contributions are two approximation algorithms, both with provable guarantees, one based on coresets and another based on hitting sets. We perform extensive experimental evaluation of our algorithms, using both real-world and synthetic data, and compare their performance against the solution proposed in [VLDB 14]. The results show that our algorithms are significantly faster and scalable to much larger sets than the greedy algorithm of Chester et al. for comparable quality answers

    Approximating Distance Measures for the Skyline

    Get PDF
    In multi-parameter decision making, data is usually modeled as a set of points whose dimension is the number of parameters, and the skyline or Pareto points represent the possible optimal solutions for various optimization problems. The structure and computation of such points have been well studied, particularly in the database community. As the skyline can be quite large in high dimensions, one often seeks a compact summary. In particular, for a given integer parameter k, a subset of k points is desired which best approximates the skyline under some measure. Various measures have been proposed, but they mostly treat the skyline as a discrete object. By viewing the skyline as a continuous geometric hull, we propose a new measure that evaluates the quality of a subset by the Hausdorff distance of its hull to the full hull. We argue that in many ways our measure more naturally captures what it means to approximate the skyline. For our new geometric skyline approximation measure, we provide a plethora of results. Specifically, we provide (1) a near linear time exact algorithm in two dimensions, (2) APX-hardness results for dimensions three and higher, (3) approximation algorithms for related variants of our problem, and (4) a practical and efficient heuristic which uses our geometric insights into the problem, as well as various experimental results to show the efficacy of our approach

    Dynamic Enumeration of Similarity Joins

    Get PDF
    corecore